I have downloaded the dataset from https://www.cbioportal.org/study/summary?id=brca_tcga_pan_can_atlas_2018, unzipped it manually (there was a nested archive inside it, so unzipped that one too). Then I copied the 3 files I need for analysis into the folder Data in my R project folder.
Reading the files. RNA sequences file takes time as it contains around 20,000 lines with ~ 1,000 variables in each line.
# patients clinical data
data_patient = read.delim("Data/data_clinical_patient.txt")
# file with Copy Number Aberrations
data_cna = read.delim("Data/data_cna.txt")
# file with RNA sequences
data_rna_seq = read.delim("Data/data_mrna_seq_v2_rsem.txt")
Top match Patient Data IDs with RNA-seq and CNA identifiers, first looking at the file which provides mapping of a patient clinical info to a sample:
data_mapping = read.delim("Data/data_clinical_sample.txt")
# show the mapping
data_mapping
As can be seen from the mapping, identifiers to uniquely specify a patient follow the format of “TCGA-3C-AAAU” and the corresponding sample identifiers have “-01” at the end, e.g., has a format of “TCGA-3C-AAAU-01”.
The patients clinical data has the following columns:
# inspect column names of the data_patient data frame
colnames(data_patient)
[1] "X.Patient.Identifier"
[2] "Subtype"
[3] "TCGA.PanCanAtlas.Cancer.Type.Acronym"
[4] "Other.Patient.ID"
[5] "Diagnosis.Age"
[6] "Sex"
[7] "Neoplasm.Disease.Stage.American.Joint.Committee.on.Cancer.Code"
[8] "American.Joint.Committee.on.Cancer.Publication.Version.Type"
[9] "Last.Communication.Contact.from.Initial.Pathologic.Diagnosis.Date"
[10] "Birth.from.Initial.Pathologic.Diagnosis.Date"
[11] "Last.Alive.Less.Initial.Pathologic.Diagnosis.Date.Calculated.Day.Value"
[12] "Ethnicity.Category"
[13] "Form.completion.date"
[14] "Neoadjuvant.Therapy.Type.Administered.Prior.To.Resection.Text"
[15] "ICD.10.Classification"
[16] "International.Classification.of.Diseases.for.Oncology..Third.Edition.ICD.O.3.Histology.Code"
[17] "International.Classification.of.Diseases.for.Oncology..Third.Edition.ICD.O.3.Site.Code"
[18] "Informed.consent.verified"
[19] "New.Neoplasm.Event.Post.Initial.Therapy.Indicator"
[20] "American.Joint.Committee.on.Cancer.Metastasis.Stage.Code"
[21] "Neoplasm.Disease.Lymph.Node.Stage.American.Joint.Committee.on.Cancer.Code"
[22] "American.Joint.Committee.on.Cancer.Tumor.Stage.Code"
[23] "Person.Neoplasm.Cancer.Status"
[24] "Primary.Lymph.Node.Presentation.Assessment"
[25] "Prior.Diagnosis"
[26] "Race.Category"
[27] "Radiation.Therapy"
[28] "Patient.Weight"
[29] "In.PanCan.Pathway.Analysis"
[30] "Overall.Survival.Status"
[31] "Overall.Survival..Months."
[32] "Disease.specific.Survival.status"
[33] "Months.of.disease.specific.survival"
[34] "Disease.Free.Status"
[35] "Disease.Free..Months."
[36] "Progression.Free.Status"
[37] "Progress.Free.Survival..Months."
[38] "Genetic.Ancestry.Label"
The patient ID is the first column named
X.Patient.Identifier. Example values of the
X.Patient.Identifier column are:
# rows from 6 to 10 of the column with Patient ID
data_patient$X.Patient.Identifier[6:10]
[1] "TCGA-3C-AALI" "TCGA-3C-AALJ" "TCGA-3C-AALK" "TCGA-4H-AAAK" "TCGA-5L-AAT0"
Inspecting column names of the Copy Number Alteration data_cna data frame. The first 10 columns only, since the dataframe has more than 1,000 columns:
str(data_cna, list.len = 10)
'data.frame': 25128 obs. of 1072 variables:
$ Hugo_Symbol : chr "ACAP3" "ACTRT2" "AGRN" "ANKRD65" ...
$ Entrez_Gene_Id : int 116983 140625 375790 441869 55210 83858 219293 54998 126792 54991 ...
$ TCGA.3C.AAAU.01: int 0 0 0 0 0 0 0 0 0 0 ...
$ TCGA.3C.AALI.01: int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
$ TCGA.3C.AALJ.01: int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
$ TCGA.3C.AALK.01: int 0 0 0 0 0 0 0 0 0 0 ...
$ TCGA.4H.AAAK.01: int 0 0 0 0 0 0 0 0 0 0 ...
$ TCGA.5L.AAT0.01: int 0 0 0 0 0 0 0 0 0 0 ...
$ TCGA.5T.A9QA.01: int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
$ TCGA.A1.A0SB.01: int 0 0 0 0 0 0 0 0 0 0 ...
[list output truncated]
The output of the str() function showed that in the Copy
Number Alterations (CNA) dataset, the first two columns are genes, and
the rest columns are Sample IDs. Each Sample ID in CNA dataset
can be calculated from the Patient ID by replacing “-” with “.”
and adding “01”.
Inspecting the RNA-seq dataframe - the first 10 columns since there are more than 1,000 (one per patient):
str(data_rna_seq, list.len = 10)
'data.frame': 20531 obs. of 1084 variables:
$ Hugo_Symbol : chr "" "" "UBE2Q2P2" "HMGB1P1" ...
$ Entrez_Gene_Id : int 100130426 100133144 100134869 10357 10431 136542 155060 26823 280660 317712 ...
$ TCGA.3C.AAAU.01: num 0 16.4 12.9 52.2 408.1 ...
$ TCGA.3C.AALI.01: num 0 9.27 17.38 69.76 563.89 ...
$ TCGA.3C.AALJ.01: num 0.907 11.623 9.229 154.297 1360.83 ...
$ TCGA.3C.AALK.01: num 0 12.1 11.1 143.9 865.5 ...
$ TCGA.4H.AAAK.01: num 0 6.85 14.43 84.21 766.38 ...
$ TCGA.5L.AAT0.01: num 0 3.99 13.61 114.26 807.74 ...
$ TCGA.5T.A9QA.01: num 0 1.46 9 107.56 1420.5 ...
$ TCGA.A1.A0SB.01: num 0 15.3 14.4 116.4 657.3 ...
[list output truncated]
The output of the str() function showed that in the
RNA-seq dataset, the first two columns are also genes, and remaining
columns are Sample IDs.
Same as was observed with the CNA dataset, Sample ID in RNA-seq file can be calculated from the Patient ID by replacing “-” with “.” and adding “01”.
Metadata is a matrix of the following size: number of rows = number of samples in RNA assay, number of columns = 1.
Element value will be either 1 (ERBB2 Amplified) or 0 (Not Amplified).
# substract all columns with Samples from RNA-seq dataset
# they are all columns except the first two
# and convert into matrix
rna_assay = as.matrix(data_rna_seq[,-c(1,2)])
# give name for rows
rownames(rna_assay) = data_rna_seq[,1]
# check size of rna_assay
print("Size of rna_assay:")
[1] "Size of rna_assay:"
dim(rna_assay)
[1] 20531 1082
# metadata is a matrix of the following size:
# number of rows = number of samples in rna_assay
# number of columns = 1
# initialise with 0
metadataERBB2 = matrix(0, dim(rna_assay)[2], 1)
# size of the metadata matrix
print("Size of metadata matrix:")
[1] "Size of metadata matrix:"
dim(metadataERBB2)
[1] 1082 1
Logic to fill in the metadata matrix:
# iterate over each sample in the RNA assay and
# take a sample ID
# and use it to retrieve an element in the CNA dataset at the position:
# row = row in which the Hugo_Symbol is equal to ERBB2 gene
# column = column which has the name equal to the current sample ID
for (i in 1:dim(rna_assay)[2]){
# each column name in RNA assay is a Sample ID
sample_id = colnames(rna_assay)[i]
value = data_cna[data_cna$Hugo_Symbol == "ERBB2", sample_id]
if (length(value) > 0 && !is.na(value)){
if (value > 0) amplified = 1
else amplified = 0
# write the value into the metadata matrix
metadataERBB2[i,1] = amplified
}
}
# give the single column of the metadata matrix a name
colnames(metadataERBB2)[1] = "Amplified"
# give names to rows of the metadata matrix, which are Sample IDs
# they should be equal to column names of the RNA assay (Sample IDs)
rownames(metadataERBB2) = colnames(rna_assay)
View the metadata matrix:
print(head(metadataERBB2,10))
Amplified
TCGA.3C.AAAU.01 0
TCGA.3C.AALI.01 1
TCGA.3C.AALJ.01 1
TCGA.3C.AALK.01 1
TCGA.4H.AAAK.01 1
TCGA.5L.AAT0.01 0
TCGA.5T.A9QA.01 1
TCGA.A1.A0SB.01 0
TCGA.A1.A0SD.01 0
TCGA.A1.A0SE.01 0
It can be seen that the matrix has row names equal to Sample IDs, and a single column “Amplified” containing 0 or 1.
Checking that there are no NA in the metadata. There shouldn’t be any NA because the matrix was initialised with 0.
print(all(!is.na(metadataERBB2)))
[1] TRUE
Installing packages for Differential Expression Genes (DEG) analysis:
if (!require("BiocManager", quietly = TRUE))
suppressMessages(suppressWarnings(install.packages("BiocManager")))
# Install DeSeq2
suppressMessages(suppressWarnings(BiocManager::install("DESeq2")))
suppressMessages(library(DESeq2))
For DEG analysis, need to construct an object of the type
DESeqDataSet which contains input values, intermediate
calculations and results of DEG analysis. I will use the
DESeqDataSetFromMatrix function from the
DESeq2 package.
Data preparation:
# round the values in rna_assay,
# because DESeq expects input as counts as are integer numbers
rna_assay = round(rna_assay)
rna_assay[is.na(rna_assay)] = 0 # replace NA with zeros
rna_assay[rna_assay < 0] = 0 # replace negative values with zeros
# filter out genes with too many missing values
# keep the gene (row) only if it has at least 3 samples with value > 10
smallestGroupSize = 3
# vector that contains TRUE or FALSE for each gene
keep_gene = rowSums(rna_assay >= 10) >= smallestGroupSize
# keep rows only if keep_gene = TRUE for that row
rna_assay = rna_assay[keep_gene,]
Checking the size after rows with many missing values were removed:
print(
paste("Size of the RNA assay after removing genes with low counts: ",
paste(dim(rna_assay), collapse = " x ")))
[1] "Size of the RNA assay after removing genes with low counts: 18603 x 1082"
It can be seen that around 2,000 genes were removed and 18,603 genes were kept.
Creating the DESeqDataSet object, which contains inputs,
intermediate calculation, and result of DEG analysis:
# create DESeqDataSet instance
dds = DESeqDataSetFromMatrix(
countData = rna_assay, # input with counts for genes
colData = metadataERBB2, # colData rows must match countData columns
design = ~ Amplified) # formula how the counts for each gene depend on colData
converting counts to integer mode
Warning in DESeqDataSet(se, design = design, ignoreRank) :
13 duplicate rownames were renamed by adding numbers
the design formula contains one or more numeric variables with integer values,
specifying a model with increasing fold change for higher values.
did you mean for this to be a factor? if so, first convert
this variable to a factor using the factor() function
The counts that I provided in the rna_assay are not normalized, and
this is expected by the DESeq() function, based on the
documentation at bioconducter.org.
The DESEq() function performs estimation of size
factors, estimation of dispersion, Negative Binomial GLM fitting and
Wald statistics.
# Run differential expression analysis
dds <- DESeq(dds)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
-- replacing outliers and refitting for 3023 genes
-- DESeq argument 'minReplicatesForReplace' = 7
-- original counts are preserved in counts(dds)
estimating dispersions
fitting model and testing
Names of the estimated effects (coefficients) of the model:
resultsNames(dds)
[1] "Intercept" "Amplified"
To access results of DEG analysis, calling the results()
function:
res = results(dds)
Top 10 most differentially expressed genes ordered by p-adjusted value:
# order by p-adjusted value
top_10_genes_padj = res[order(res$padj)[1:10],]
# print
print(top_10_genes_padj)
log2 fold change (MLE): Amplified
Wald test p-value: Amplified
DataFrame with 10 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
PSMD3 5059.91 2.04226 0.0579065 35.2683 1.79889e-272 3.34647e-268
STARD3 1694.82 2.18348 0.0625963 34.8819 1.39645e-266 1.29891e-262
ERBB2 18345.66 2.77945 0.0797532 34.8506 4.16245e-266 2.58113e-262
C17orf37 1959.93 2.42679 0.0703565 34.4928 1.02688e-260 4.77575e-257
GRB7 1308.10 2.59843 0.0789146 32.9271 9.00268e-238 3.34954e-234
PGAP3 1917.73 2.28601 0.0708165 32.2807 1.30456e-228 4.04479e-225
ORMDL3 3966.57 2.06060 0.0658508 31.2919 6.00605e-215 1.59615e-211
MED1 2541.95 1.81671 0.0601066 30.2248 1.11900e-200 2.60210e-197
MSL1 3083.81 1.39581 0.0494692 28.2158 3.74688e-175 7.74480e-172
CDK12 2051.81 1.47018 0.0523967 28.0587 3.12751e-173 5.81811e-170
Top 10 upregulated genes based on the fold change:
# order by log2 fold change
# when using descending order, log2 > 0
# and so returns top 10 upregulated genes
top_10_genes_log2fc_up = res[order(res$log2FoldChange, decreasing = TRUE)[1:10],]
print(top_10_genes_log2fc_up)
log2 fold change (MLE): Amplified
Wald test p-value: Amplified
DataFrame with 10 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
SPANXA2 2.31235 4.35987 0.633290 6.88447 5.80022e-12 1.01221e-10
GAGE12D 10.12361 4.32213 0.792895 5.45108 5.00657e-08 4.09215e-07
SPANXC 2.41072 4.13850 0.521456 7.93644 2.08071e-15 6.83879e-14
GAGE2B 1.50314 4.06720 1.417492 2.86929 4.11391e-03 9.95721e-03
FAM9C 1.67308 3.39339 0.297496 11.40651 3.87983e-30 5.50965e-28
GAGE4 4.39471 3.33288 0.672714 4.95438 7.25601e-07 4.55881e-06
PNMT 163.64015 3.29807 0.178970 18.42806 7.82327e-76 5.59755e-73
KRT20 3.49946 3.06552 0.434159 7.06083 1.65515e-12 3.23433e-11
TBX10 8.58797 3.00122 0.398242 7.53619 4.83907e-14 1.26434e-12
GAGE2D 4.03952 2.95668 0.751347 3.93518 8.31351e-05 3.15812e-04
Top 10 downregulated genes based on the fold change:
# order by log2 fold change
# when using ascending order, log2 < 0
# and so returns top 10 downregulated genes
top_10_genes_log2fc_down = res[order(res$log2FoldChange, decreasing = FALSE)[1:10],]
print(top_10_genes_log2fc_down)
log2 fold change (MLE): Amplified
Wald test p-value: Amplified
DataFrame with 10 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
CSN2 18.33147 -4.54022 0.827808 -5.48463 4.14340e-08 3.44567e-07
SMR3B 60.73650 -3.76811 0.421180 -8.94655 3.66770e-19 1.91121e-17
CSN3 47.97724 -3.72842 0.489189 -7.62164 2.50471e-14 6.87244e-13
LALBA 26.22764 -3.49553 0.721085 -4.84761 1.24960e-06 7.41982e-06
LACRT 30.24219 -3.23241 0.506075 -6.38722 1.68932e-10 2.29390e-09
NTS 43.91369 -3.16044 0.307213 -10.28746 8.02694e-25 7.77735e-23
CARTPT 302.97827 -3.02651 0.557301 -5.43067 5.61435e-08 4.52922e-07
RLBP1 3.39960 -3.01570 0.409080 -7.37191 1.68201e-13 3.99113e-12
UCP1 6.78740 -3.00797 0.313735 -9.58759 9.01657e-22 6.40211e-20
CSN1S1 4.61509 -2.85141 0.456347 -6.24834 4.14836e-10 5.15167e-09
Required packages:
if (!requireNamespace("clusterProfiler", quietly = TRUE))
BiocManager::install("clusterProfiler", quit = TRUE)
if (!requireNamespace("org.Hs.eg.db", quietly = TRUE))
BiocManager::install("org.Hs.eg.db")
if (!requireNamespace("enrichplot", quietly = TRUE))
install.packages("enrichplot")
library(clusterProfiler)
library(enrichplot)
library(org.Hs.eg.db)
Using differential expressed genes, which were obtained by DESeq() function, with p-adjusted < 0.05:
# filter DESEq results to keep
# differentially expressed genes with p-adjusted < 0.05
res_significant = res[res$padj<0.05,]
# separate into over- and underexpressed by log2 fold change
# overexpressed
res_over = rownames(res_significant[res_significant$log2FoldChange > 0,])
# underexpressed
res_under = rownames(res_significant[res_significant$log2FoldChange < 0,])
Gene Ontology (GO) enrichment analysis:
# GO enrichment analysis
go_results_over = enrichGO(
gene = res_over, # overexpressed genes list
OrgDb = org.Hs.eg.db, # humans
keyType = "SYMBOL",
ont = "BP", # biological processes
pAdjustMethod = "BH",
pvalueCutoff = 0.05,
qvalueCutoff = 0.05
)
Enriched biological processes associated with the overexpressed genes:
# print top 6 results
print(head(go_results_over[,1-10]))
Plot of enriched biological processes:
dotplot(go_results_over, showCategory = 10) +
ggtitle("Gene Ontology Enrichment - overexpressed")
GO analysis of the underexpressed genes:
go_results_under = enrichGO(
gene = res_under, # under expressed genes list
OrgDb = org.Hs.eg.db, # humans
keyType = "SYMBOL",
ont = "BP", # biological processes
pAdjustMethod = "BH",
pvalueCutoff = 0.05,
qvalueCutoff = 0.05
)
Enriched biological processes associated with the underexpressed genes:
# print top 6 results
print(head(go_results_under))
Plot of the enriched pathways related to the underexpressed genes:
dotplot(go_results_under, showCategory = 10) +
ggtitle("Gene Ontology Enrichment - Underexpressed")
Installing packages for pathway enrichment analysis using Reactome:
if (!requireNamespace("pathview", quietly = TRUE))
BiocManager::install("pathview")
if (!requireNamespace("ReactomePA", quietly = TRUE))
BiocManager::install("ReactomePA")
library(ReactomePA)
library(pathview)
KEGG (Kyoto Encyclopedia of Genes and Genomes) enrichment analysis:
# map genes codes to entrez - required to use Reactome and Kegg functions
gene_entrez_over = bitr(
res_over,
fromType = "SYMBOL",
toType = "ENTREZID",
OrgDb = org.Hs.eg.db
)
gene_entrez_under = bitr(
res_under,
fromType = "SYMBOL",
toType = "ENTREZID",
OrgDb = org.Hs.eg.db
)
kegg_results_over = enrichKEGG(
gene = gene_entrez_over[,2],
organism = "human",
pAdjustMethod = "BH",
pvalueCutoff = 0.05,
qvalueCutoff = 0.05
)
kegg_results_under = enrichKEGG(
gene = gene_entrez_under[,2],
organism = "human",
pAdjustMethod = "BH",
pvalueCutoff = 0.05,
qvalueCutoff = 0.05
)
Significantly enriched KEGG pathways or functional categories for most overexpressed genes:
print(head(kegg_results_over))
dotplot(kegg_results_over, showCategory = 10) + ggtitle("Kegg - overexpressed")
Significantly enriched KEGG pathways or functional categories for most underxpressed genes:
print(head(kegg_results_under))
dotplot(kegg_results_under, showCategory = 10) + ggtitle("Kegg - underexpressed")
Reactome results.
reactome_results_over = enrichPathway(
gene = gene_entrez_over[,2],
organism = "human",
pAdjustMethod = "BH",
pvalueCutoff = 0.05,
qvalueCutoff = 0.05,
)
reactome_results_under = enrichPathway(
gene = gene_entrez_under[,2],
organism = "human",
pAdjustMethod = "BH",
pvalueCutoff = 0.05,
qvalueCutoff = 0.05,
)
Pathways associated with overexpressed genes:
print(head(reactome_results_over))
dotplot(reactome_results_over, showCategory = 10) + ggtitle("Reactome - overxpressed")
Pathways associated with underexpressed genes:
print(head(reactome_results_under))
dotplot(reactome_results_under, showCategory=10) + ggtitle("Reactome - underexpressed")
Applying a variance stabilizing transformation (VST) to the RNA-seq count data. VST automatically normalizes the count data, providing matrix of values which are approximately homoskedastic (have constant variance along the range of mean values).
# variance stabilizing transformation
vst = vst(dds)
PCA plot:
# PCA plot of the vst results grouped by the Amplified status
plotPCA(vst, intgroup=c("Amplified"))
using ntop=500 top features by variance
Pacakges for Heatmap:
# install packages for heatmap
if (!requireNamespace("pheatmap", quietly = TRUE))
install.packages("pheatmap")
library(pheatmap)
Subset of the most differentially expressed genes to visualize:
# sort by p-adjusted
genes_sorted = order(res$padj)
# VST results for top 20 genes
vst_top20 = assay(vst)[genes_sorted[1:20],]
Plotting the heatmap:
# column for annotation
annotation_column = as.data.frame(colData(dds)[, "Amplified", drop = FALSE])
# colors for annotation
# annotation_colors = list(c("High" = "darkred", "Low" = "lightblue"))
pheatmap(
vst_top20,
cluster_rows = TRUE,
cluster_cols = TRUE,
scale = 'row',
show_colnames = FALSE,
show_rownames = TRUE,
annotation_col = annotation_column,
main = "Heatmap of Gene Expression by ERBB2 Amplified status")
Packages:
library(survminer)
Loading required package: ggplot2
Loading required package: ggpubr
Attaching package: ‘ggpubr’
The following object is masked from ‘package:enrichplot’:
color_palette
Attaching package: ‘survminer’
The following object is masked from ‘package:survival’:
myeloma
Data preparation: extracting survival times (non-zero and non NA), event status and the corresponding subset of the VST matrix. VST matrix is subset based on correspondence of Sample ID and Patient ID from a clinical info data:
# vst-transformed matrix
vst_matrix = assay(vst)
# transform patient identifier to match sample IDs in vst_matrix
# as a result, transforms TCGA-3C-AAAU to TCGA.3C.AAAU
sample_ids = gsub("-", ".", data_patient$X.Patient.Identifier[5:nrow(data_patient)])
# extract sample ids from vst without suffix .01
sample_ids_vst = substr(colnames(vst_matrix), 1, 12)
# convert survival_time and event_status to numeric
survival_time = as.numeric(data_patient$Overall.Survival..Months.[5:nrow(data_patient)])
event_status = data_patient$Overall.Survival.Status[5:nrow(data_patient)]
event_status = ifelse(event_status == "1:DECEASED", 1, 0)
# filter out survival time that are less than equal 0
valid_index = which(survival_time > 0)
survival_time = survival_time[valid_index]
event_status = event_status[valid_index]
sample_ids = sample_ids[valid_index]
# remove rows with NA values
complete_cases_index = complete.cases(survival_time, event_status)
survival_time = survival_time[complete_cases_index]
event_status = event_status[complete_cases_index]
sample_ids = sample_ids[complete_cases_index]
# match sample IDs derived from Patient IDs and VST sample IDs
# keep those that present in both
matching_ids = intersect(sample_ids, sample_ids_vst)
# their indexes
matching_index = which(sample_ids %in% matching_ids)
# subset survival_time and event_status by indexes of matching ids
survival_time = survival_time[matching_index]
event_status = event_status[matching_index]
# subset vst_matrix by matching ids
matching_ids_vst = colnames(vst_matrix)[substr(colnames(vst_matrix), 1, 12) %in% matching_ids]
vst_matrix = vst_matrix[, matching_ids_vst]
Transpose vst_matrix so that samples are rows, genes are columns:
vst_matrix = t(vst_matrix)
Check no NA:
sum(is.na(survival_time))
[1] 0
sum(is.na(event_status))
[1] 0
sum(is.na(vst_matrix))
[1] 0
Create survival object:
survival_object = Surv(time = survival_time, event = event_status)
Cox model with alpha = 1 to apply lasso penalty:
# Cox model and Lasso regularization using glmnet
cox_model = glmnet(vst_matrix,
survival_object,
family = "cox",
alpha = 1) # lasso penalty
K-fold (5) cross-validation for glmnet:
# produces a plot and returns a value for lambda
cv_fit = cv.glmnet(vst_matrix,
survival_object,
family = "cox",
nfolds = 5) # reduced to 5 due to slow execution
# best lambda
best_lambda = cv_fit$lambda.min
print(best_lambda)
[1] 0.03949631
Survival risk:
# survival risk
risk_score = predict(cox_model,
s = best_lambda,
newx = vst_matrix)
print(head(risk_score))
1
TCGA.3C.AAAU.01 1.972002
TCGA.3C.AALI.01 2.145560
TCGA.3C.AALJ.01 2.189554
TCGA.3C.AALK.01 2.010798
TCGA.4H.AAAK.01 2.027707
TCGA.5L.AAT0.01 1.930621
Plot data: